Framing the problem¶
The goal of this report is to provide real estate developers and investors with valuable insights to make informed decisions that can significantly impact economic growth, urban development, and community well-being.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as plotly
import plotly.express as px
plotly.offline.init_notebook_mode()
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Getting the data¶
For this report, we will be using this dataset. Kaggle
A building permit is an official approval document issued by a governmental agency that allows you or your contractor to proceed with a construction or remodeling project on one's property. For more details go to https://www.thespruce.com/what-is-a-building-permit-1398344. Each city or county has its own office related to buildings, that can do multiple functions like issuing permits, inspecting buildings to enforce safety measures, modifying rules to accommodate needs of the growing population etc. For the city of San Francisco, permit issuing is taken care by www.sfdbi.org/
Why is this important: In the recent past, several posts and blogs highlighted that main discrepancy in demand and supply in real estate industry is due to delays in issuing building permits. Refer: https://www.trulia.com/blog/trends/elasticity-2016/ - Introduces concept of elasticity, and nice scatterplot of various cities. A good data story! https://biv.com/article/2014/11/city-building-permit-delays-costing-developers-tim
Content The data was downloaded for the dates ranging from Jan 1st, 2013 to Feb 25th, 2018 using the filter in San Francisco open data portal. This is the exact link: https://data.sfgov.org/Housing-and-Buildings/Building-Permits/i98e-djp9/data There are 43 columns and close to 200k records in the downloaded version (kept here). Description is separately uploaded as dictionary.
df = pd.read_csv('./data/Building_Permits.csv')
df.head(5)
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\855207747.py:1: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False.
| Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | NaN | Ellis | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
| 1 | 201604195146 | 4 | sign - erect | 04/19/2016 | 0306 | 007 | 440 | NaN | Geary | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 |
| 2 | 201605278609 | 3 | additions alterations or repairs | 05/27/2016 | 0595 | 203 | 1647 | NaN | Pacific | Av | ... | 1.0 | constr type 1 | 1.0 | constr type 1 | NaN | 3.0 | Russian Hill | 94109.0 | (37.7946573324287, -122.42232562979227) | 1424856504716 |
| 3 | 201611072166 | 8 | otc alterations permit | 11/07/2016 | 0156 | 011 | 1230 | NaN | Pacific | Av | ... | 5.0 | wood frame (5) | 5.0 | wood frame (5) | NaN | 3.0 | Nob Hill | 94109.0 | (37.79595867909168, -122.41557405519474) | 1443574295566 |
| 4 | 201611283529 | 6 | demolitions | 11/28/2016 | 0342 | 001 | 950 | NaN | Market | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 6.0 | Tenderloin | 94102.0 | (37.78315261897309, -122.40950883997789) | 144548169992 |
5 rows × 43 columns
We can see a description of the columns below.
desc = pd.read_excel('./data/DataDictionaryBuildingPermit.xlsx')
desc
| Sl No | Column name | Description | |
|---|---|---|---|
| 0 | 1 | Permit Number | Number assigned while filing |
| 1 | 2 | Permit Type | Type of the permit represented numerically. |
| 2 | 3 | Permit Type Definition | Description of the Permit type, for example\n ... |
| 3 | 4 | Permit Creation Date | Date on which permit created, later than \nor ... |
| 4 | 5 | Block | Related to address |
| 5 | 6 | Lot | Related to address |
| 6 | 7 | Street Number | Related to address |
| 7 | 8 | Street Number Suffix | Related to address |
| 8 | 9 | Street Name | Related to address |
| 9 | 10 | Street Name Suffix | Related to address |
| 10 | 11 | Unit | Unit of a building |
| 11 | 12 | Unit suffix | Suffix if any, for the unit |
| 12 | 13 | Description | Details about purpose of the permit.\n Example... |
| 13 | 14 | Current Status | Current status of the permit application. |
| 14 | 15 | Current Status Date | Date at which current status was entered |
| 15 | 16 | Filed Date | Filed date for the permit |
| 16 | 17 | Issued Date | Issued date for the permit |
| 17 | 18 | Completed Date | The date on which project was completed, \napp... |
| 18 | 19 | First Construction Document Date | Date on which construction was documented |
| 19 | 20 | Structural Notification | Notification to meet some legal need, given or... |
| 20 | 21 | Number of Existing Stories | Number of existing stories in the building. \n... |
| 21 | 22 | Number of Proposed Stories | Number of proposed stories for the constructio... |
| 22 | 23 | Voluntary Soft-Story \nRetrofit | Soft story to meet earth quake regulations |
| 23 | 24 | Fire Only Permit | Fire hazard prevention related permit |
| 24 | 25 | Permit Expiration Date | Expiration date related to issued permit. |
| 25 | 26 | Estimated Cost | Initial estimation of the cost of the project |
| 26 | 27 | Revised Cost | Revised estimation of the cost of the project |
| 27 | 28 | Existing Use | Existing use of the building |
| 28 | 29 | Existing Units | Existing number of units |
| 29 | 30 | Proposed Use | Proposed use of the building |
| 30 | 31 | Proposed Units | Proposed number of units |
| 31 | 32 | Plansets | Plan representation indicating the general des... |
| 32 | 33 | TIDF Compliance | TIDF compliant or not, this is a new legal req... |
| 33 | 34 | Existing Construction Type | Construction type, existing,as categories \nre... |
| 34 | 35 | Existing Construction Type Description | Description of the above, for example, \nwood ... |
| 35 | 36 | Proposed Construction Type | Construction type, proposed, as categories\n r... |
| 36 | 37 | Proposed Construction Type Description | Description of the above |
| 37 | 38 | Site Permit | Permit for site |
| 38 | 39 | Supervisor District | Supervisor District to which the building loca... |
| 39 | 40 | Neighborhoods - Analysis Boundaries | Neighborhood to which the building location be... |
| 40 | 41 | Zipcode | Zipcode of building address |
| 41 | 42 | Location | Location in latitude, longitude pair. |
| 42 | 43 | Record ID | Some ID, not useful for this |
Data Preparation¶
Let us see the state of our dataset.
df.shape
(198900, 43)
df.isnull().sum()
Permit Number 0 Permit Type 0 Permit Type Definition 0 Permit Creation Date 0 Block 0 Lot 0 Street Number 0 Street Number Suffix 196684 Street Name 0 Street Suffix 2768 Unit 169421 Unit Suffix 196939 Description 290 Current Status 0 Current Status Date 0 Filed Date 0 Issued Date 14940 Completed Date 101709 First Construction Document Date 14946 Structural Notification 191978 Number of Existing Stories 42784 Number of Proposed Stories 42868 Voluntary Soft-Story Retrofit 198865 Fire Only Permit 180073 Permit Expiration Date 51880 Estimated Cost 38066 Revised Cost 6066 Existing Use 41114 Existing Units 51538 Proposed Use 42439 Proposed Units 50911 Plansets 37309 TIDF Compliance 198898 Existing Construction Type 43366 Existing Construction Type Description 43366 Proposed Construction Type 43162 Proposed Construction Type Description 43162 Site Permit 193541 Supervisor District 1717 Neighborhoods - Analysis Boundaries 1725 Zipcode 1716 Location 1700 Record ID 0 dtype: int64
That is a lot of missing values.
Let us start by converting any date column to the actual datetime format.
date_columns = ['Permit Creation Date', 'Current Status Date', 'Filed Date', 'Issued Date',
'Completed Date', 'First Construction Document Date', 'Permit Expiration Date']
df[date_columns] = df[date_columns].astype('datetime64[ns]')
For this report, we will try to analyze and predict how long it takes to get a permit for the building.
To do this, we will also need to create a new column in the dataset, called days_to_issue.
df['Days to Issue'] = (df['Issued Date'] - df['Filed Date']).dt.days
fire_only_permit_counts = df['Fire Only Permit'].value_counts()
structural_notification_counts = df['Structural Notification'].value_counts()
site_permit_counts = df['Site Permit'].value_counts()
print(fire_only_permit_counts)
print(structural_notification_counts)
print(site_permit_counts)
Fire Only Permit Y 18827 Name: count, dtype: int64 Structural Notification Y 6922 Name: count, dtype: int64 Site Permit Y 5359 Name: count, dtype: int64
We can see that in the columns above, there is a Y value count in some of the rows and the rest of the rows are empty, we will fill this empty rows with N or better still, change the values to True or False.
df['Fire Only Permit'] = df['Fire Only Permit'].map({'Y': True, np.nan: False})
df['Structural Notification'] = df['Structural Notification'].map({'Y': True, np.nan: False})
df['Site Permit'] = df['Site Permit'].map({'Y': True, np.nan: False})
print(df['Fire Only Permit'].value_counts())
print(df['Structural Notification'].value_counts())
print(df['Site Permit'].value_counts())
Fire Only Permit False 180073 True 18827 Name: count, dtype: int64 Structural Notification False 191978 True 6922 Name: count, dtype: int64 Site Permit False 193541 True 5359 Name: count, dtype: int64
Some of the columns are empty or almost empty so there is no way to remedy them so we will be dropping those columns.
columns_to_drop = ['TIDF Compliance', 'Voluntary Soft-Story Retrofit']
df.drop(columns=columns_to_drop, inplace=True)
Let us see how the dataset looks now with the preliminary cleaning done.
missing_values = df.isnull().sum()
missing_values = missing_values.sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=missing_values.index, y=missing_values.values, hue=missing_values.index, palette="viridis", dodge=False, legend=False)
plt.xticks(rotation=90)
plt.title('Missing Values')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.show()
missing_values = df.isnull().sum()
missing_values = missing_values.sort_values(ascending=False)
# Convert the missing values to a DataFrame for Plotly
missing_values_df = missing_values.reset_index()
missing_values_df.columns = ['Columns', 'Number of Missing Values']
# Create a bar plot using Plotly
fig = px.bar(missing_values_df, x='Columns', y='Number of Missing Values',
title='Missing Values',
labels={'Columns': 'Columns', 'Number of Missing Values': 'Number of Missing Values'},
color='Columns')
fig.update_layout(xaxis={'categoryorder':'total descending'}, xaxis_tickangle=-90, height=1000)
fig.show()
We can see that the Unit suffix, Street Number Suffix and Unit contain the most missing values but they aren't really relevant because not all houses or buildings have one and they don't affect the permit issuance.
# calculate the percentage of missing values for each column
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values = missing_values / df.shape[0] * 100
missing_values = missing_values.sort_values(ascending=False)
missing_values
Unit Suffix 99.014077 Street Number Suffix 98.885872 Unit 85.178984 Completed Date 51.135747 Permit Expiration Date 26.083459 Existing Units 25.911513 Proposed Units 25.596280 Existing Construction Type Description 21.802916 Existing Construction Type 21.802916 Proposed Construction Type Description 21.700352 Proposed Construction Type 21.700352 Number of Proposed Stories 21.552539 Number of Existing Stories 21.510307 Proposed Use 21.336853 Existing Use 20.670689 Estimated Cost 19.138260 Plansets 18.757667 First Construction Document Date 7.514329 Issued Date 7.511312 Days to Issue 7.511312 Revised Cost 3.049774 Street Suffix 1.391654 Neighborhoods - Analysis Boundaries 0.867270 Supervisor District 0.863248 Zipcode 0.862745 Location 0.854701 Description 0.145802 dtype: float64
# drop the following columns unit suffix, street suffix, street number suffix, and zip code
columns_to_drop = ['Unit Suffix', 'Unit', 'Street Suffix', 'Street Number Suffix']
df.drop(columns=columns_to_drop, inplace=True)
df
| Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Name | Description | Current Status | ... | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | Days to Issue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201505065519 | 4 | sign - erect | 2015-05-06 | 0326 | 023 | 140 | Ellis | ground fl facade: to erect illuminated, electr... | expired | ... | constr type 3 | NaN | NaN | False | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 | 187.0 |
| 1 | 201604195146 | 4 | sign - erect | 2016-04-19 | 0306 | 007 | 440 | Geary | remove (e) awning and associated signs. | issued | ... | constr type 3 | NaN | NaN | False | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 | 471.0 |
| 2 | 201605278609 | 3 | additions alterations or repairs | 2016-05-27 | 0595 | 203 | 1647 | Pacific | installation of separating wall | withdrawn | ... | constr type 1 | 1.0 | constr type 1 | False | 3.0 | Russian Hill | 94109.0 | (37.7946573324287, -122.42232562979227) | 1424856504716 | NaN |
| 3 | 201611072166 | 8 | otc alterations permit | 2016-11-07 | 0156 | 011 | 1230 | Pacific | repair dryrot & stucco at front of bldg. | complete | ... | wood frame (5) | 5.0 | wood frame (5) | False | 3.0 | Nob Hill | 94109.0 | (37.79595867909168, -122.41557405519474) | 1443574295566 | 253.0 |
| 4 | 201611283529 | 6 | demolitions | 2016-11-28 | 0342 | 001 | 950 | Market | demolish retail/office/commercial 3-story buil... | issued | ... | constr type 3 | NaN | NaN | False | 6.0 | Tenderloin | 94102.0 | (37.78315261897309, -122.40950883997789) | 144548169992 | 368.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 198895 | M862628 | 8 | otc alterations permit | 2017-12-05 | 0113 | 017A | 1228 | Montgomery | street space | issued | ... | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | 1489337276729 | 0.0 |
| 198896 | 201712055595 | 8 | otc alterations permit | 2017-12-05 | 0271 | 014 | 580 | Bush | fire alarm upgrade ref 201704123852 | issued | ... | wood frame (5) | 5.0 | wood frame (5) | False | NaN | NaN | NaN | NaN | 1489462354993 | 1.0 |
| 198897 | M863507 | 8 | otc alterations permit | 2017-12-06 | 4318 | 019 | 1568 | Indiana | street space | issued | ... | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | 1489539379952 | 0.0 |
| 198898 | M863747 | 8 | otc alterations permit | 2017-12-06 | 0298 | 029 | 795 | Sutter | street space permit | issued | ... | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | 1489608233656 | 0.0 |
| 198899 | M864287 | 8 | otc alterations permit | 2017-12-07 | 0160 | 006 | 838 | Pacific | street space permit | issued | ... | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | 1489796283803 | 0.0 |
198900 rows × 38 columns
df.isnull().sum()
Permit Number 0 Permit Type 0 Permit Type Definition 0 Permit Creation Date 0 Block 0 Lot 0 Street Number 0 Street Name 0 Description 290 Current Status 0 Current Status Date 0 Filed Date 0 Issued Date 14940 Completed Date 101709 First Construction Document Date 14946 Structural Notification 0 Number of Existing Stories 42784 Number of Proposed Stories 42868 Fire Only Permit 0 Permit Expiration Date 51880 Estimated Cost 38066 Revised Cost 6066 Existing Use 41114 Existing Units 51538 Proposed Use 42439 Proposed Units 50911 Plansets 37309 Existing Construction Type 43366 Existing Construction Type Description 43366 Proposed Construction Type 43162 Proposed Construction Type Description 43162 Site Permit 0 Supervisor District 1717 Neighborhoods - Analysis Boundaries 1725 Zipcode 1716 Location 1700 Record ID 0 Days to Issue 14940 dtype: int64
#get me the rows with missing values in the issued date column
missing_issued_date = df[df['Issued Date'].isnull()]
# show the distribution of the current status column
plt.figure(figsize=(12, 6))
sns.countplot(data=missing_issued_date, x='Current Status', palette='viridis')
plt.xticks(rotation=45)
plt.title('Current Status Distribution')
plt.xlabel('Current Status')
plt.ylabel('Count')
plt.show()
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\385284598.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
# distribution of current status column
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='Current Status', palette='viridis')
plt.xticks(rotation=45)
plt.title('Current Status Distribution')
plt.xlabel('Current Status')
plt.ylabel('Count')
plt.show()
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\2324579989.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
# unique count of the current status column
df['Current Status'].value_counts()
Current Status complete 97077 issued 83559 filed 12043 withdrawn 1754 cancelled 1536 expired 1370 approved 733 reinstated 563 suspend 193 revoked 50 plancheck 16 appeal 2 disapproved 2 incomplete 2 Name: count, dtype: int64
# give me a dataframe with current status = cancelled or suspend or revoked
cancelled_suspended_revoked = df[(df['Current Status'] == 'cancelled') | (df['Current Status'] == 'suspend') | (df['Current Status'] == 'revoked')]
cancelled_suspended_revoked.head(5)
| Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Name | Description | Current Status | ... | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | Days to Issue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90 | 201303222819 | 7 | wall or painted sign | 2013-03-22 | 0719 | 002 | 901 | Van Ness | to erect electric, projecting, double faced si... | cancelled | ... | NaN | NaN | NaN | False | 5.0 | Western Addition | 94109.0 | (37.78400192377985, -122.42169433153967) | 129957380462 | NaN |
| 93 | 201501206148 | 3 | additions alterations or repairs | 2015-01-20 | 1226 | 019 | 418 | Shrader | rear structure - converting structure at the r... | cancelled | ... | wood frame (5) | 5.0 | wood frame (5) | False | 5.0 | Haight Ashbury | 94117.0 | (37.770762127562946, -122.45184046838122) | 1370343502813 | NaN |
| 100 | 201603313579 | 3 | additions alterations or repairs | 2016-03-31 | 1289 | 020 | 1591 | Shrader | building addition deck & balcony at (e) 3-stor... | cancelled | ... | wood frame (5) | 5.0 | wood frame (5) | True | 5.0 | Inner Sunset | 94117.0 | (37.760086498456246, -122.45055881787229) | 141801095518 | NaN |
| 125 | 201705035616 | 8 | otc alterations permit | 2017-05-03 | 4150 | 021 | 1061 | Florida | reroofing | cancelled | ... | wood frame (5) | 5.0 | wood frame (5) | False | 9.0 | Mission | 94110.0 | (37.754907619215444, -122.40999547082743) | 1461797171313 | 0.0 |
| 571 | 201301027105 | 8 | otc alterations permit | 2013-01-02 | 1173 | 007 | 40 | Loyola | upgrade existing furnace room by installing 5/... | cancelled | ... | wood frame (5) | 5.0 | wood frame (5) | False | 1.0 | Lone Mountain/USF | 94117.0 | (37.776229196787504, -122.44877581259863) | 129220091307 | 0.0 |
5 rows × 38 columns
# show the construction type distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=cancelled_suspended_revoked, x='Permit Type Definition', palette='viridis')
plt.xticks(rotation=45)
plt.title('Construction Type Distribution')
plt.xlabel('Construction Type')
plt.ylabel('Count')
plt.show()
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\3381986286.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
Our most important columns here seem to be Issued Date and Days to Issue, which 7% of their values are missing. We will remove all the rows with missing values from these columns.
# remove rows with missing values in Issued Date and Days to Issue columns
df.dropna(subset=['Issued Date', 'Days to Issue'], inplace=True)
df.isnull().sum()
Permit Number 0 Permit Type 0 Permit Type Definition 0 Permit Creation Date 0 Block 0 Lot 0 Street Number 0 Street Name 0 Description 86 Current Status 0 Current Status Date 0 Filed Date 0 Issued Date 0 Completed Date 86788 First Construction Document Date 32 Structural Notification 0 Number of Existing Stories 40757 Number of Proposed Stories 40822 Fire Only Permit 0 Permit Expiration Date 36960 Estimated Cost 37065 Revised Cost 9 Existing Use 39187 Existing Units 48685 Proposed Use 40414 Proposed Units 48197 Plansets 36931 Existing Construction Type 41292 Existing Construction Type Description 41292 Proposed Construction Type 41084 Proposed Construction Type Description 41084 Site Permit 0 Supervisor District 1657 Neighborhoods - Analysis Boundaries 1664 Zipcode 1656 Location 1641 Record ID 0 Days to Issue 0 dtype: int64
For the rest columns, we will replace with the mode by grouping the columns.
import pandas as pd
def fill_missing_values_all_columns(df, group_columns):
df_copy = df.copy()
def fill_missing(x):
mode_values = x.mode()
filled_x = x.fillna(mode_values[0] if not mode_values.empty else x)
return filled_x.infer_objects(copy=False)
for target_column in df_copy.columns:
if target_column not in group_columns:
df_copy[target_column] = df_copy.groupby(group_columns)[target_column].transform(fill_missing)
return df_copy
# Optionally, set the Pandas option to opt-in to the future behavior
pd.set_option('future.no_silent_downcasting', True)
group_columns = ['Street Name', 'Block']
df_filled = fill_missing_values_all_columns(df, group_columns)
missing_values = df_filled.isnull().sum()
missing_values = missing_values.sort_values(ascending=False)
print(missing_values)
Completed Date 4368 Existing Units 1252 Existing Construction Type 1204 Existing Construction Type Description 1204 Proposed Units 1145 Number of Existing Stories 1121 Proposed Construction Type Description 1035 Proposed Construction Type 1035 Number of Proposed Stories 982 Existing Use 957 Proposed Use 905 Estimated Cost 773 Permit Expiration Date 772 Plansets 772 Neighborhoods - Analysis Boundaries 101 Zipcode 94 Supervisor District 92 Location 89 Description 2 First Construction Document Date 1 Site Permit 0 Record ID 0 Permit Number 0 Revised Cost 0 Permit Type 0 Fire Only Permit 0 Structural Notification 0 Issued Date 0 Filed Date 0 Current Status Date 0 Current Status 0 Street Name 0 Street Number 0 Lot 0 Block 0 Permit Creation Date 0 Permit Type Definition 0 Days to Issue 0 dtype: int64
df_filled.isnull().sum()
Permit Number 0 Permit Type 0 Permit Type Definition 0 Permit Creation Date 0 Block 0 Lot 0 Street Number 0 Street Name 0 Description 2 Current Status 0 Current Status Date 0 Filed Date 0 Issued Date 0 Completed Date 4368 First Construction Document Date 1 Structural Notification 0 Number of Existing Stories 1121 Number of Proposed Stories 982 Fire Only Permit 0 Permit Expiration Date 772 Estimated Cost 773 Revised Cost 0 Existing Use 957 Existing Units 1252 Proposed Use 905 Proposed Units 1145 Plansets 772 Existing Construction Type 1204 Existing Construction Type Description 1204 Proposed Construction Type 1035 Proposed Construction Type Description 1035 Site Permit 0 Supervisor District 92 Neighborhoods - Analysis Boundaries 101 Zipcode 94 Location 89 Record ID 0 Days to Issue 0 dtype: int64
df = df_filled.copy()
Data Analysis¶
Permit Analysis¶
permit_type_def_counts = df['Permit Type Definition'].value_counts()
print(permit_type_def_counts)
Permit Type Definition otc alterations permit 170524 additions alterations or repairs 9439 sign - erect 2403 new construction wood frame 601 demolitions 377 wall or painted sign 362 new construction 176 grade or quarry or fill or excavate 78 Name: count, dtype: int64
Permit Type Definitions:
OTC Alterations Permit: Minor alterations or repairs processed quickly.
Additions Alterations or Repairs: Permits for additions, alterations, or repairs.
Sign - Erect: Permits for installation or erection of signs.
New Construction Wood Frame: Permits for new buildings with wood frame structure.
Demolitions: Permits for demolition of existing structures.
Wall or Painted Sign: Permits for wall signs or painted signs.
New Construction: Permits for entirely new construction projects.
Grade or Quarry or Fill or Excavate: Permits for grading, quarrying, filling, or excavating land.
plt.figure(figsize=(12, 6))
count_plot = sns.countplot(x='Permit Type Definition', data=df, order=df['Permit Type Definition'].value_counts().index)
plt.title('Permit Type Definition Counts')
plt.xlabel('Permit Type Definition')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
for p in count_plot.patches:
count_plot.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()
# Create a DataFrame with the counts
permit_counts = df['Permit Type Definition'].value_counts().reset_index()
permit_counts.columns = ['Permit Type Definition', 'Count']
# Create a bar plot using Plotly
fig = px.bar(permit_counts, x='Permit Type Definition', y='Count',
title='Permit Type Definition Counts',
labels={'Permit Type Definition': 'Permit Type Definition', 'Count': 'Count'},
text='Count')
# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-45, height=1000)
fig.update_traces(textposition='outside')
# Show the plot
fig.show()
# Calculate the mean days to issue for each permit type
mean_days_permit_type = df.groupby('Permit Type Definition')['Days to Issue'].mean().reset_index()
mean_days_permit_type = mean_days_permit_type.sort_values(by='Days to Issue')
# Create a bar plot using Plotly
fig = px.bar(mean_days_permit_type, x='Permit Type Definition', y='Days to Issue',
title='Average Days to Issue for Each Permit Type',
labels={'Permit Type Definition': 'Permit Type Definition', 'Days to Issue': 'Average Days to Issue'},
text='Days to Issue',
color='Days to Issue')
# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-45, height=800)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
# Show the plot
fig.show()
permit_type_counts = df['Permit Type Definition'].value_counts().reset_index()
permit_type_counts.columns = ['Permit Type Definition', 'Count']
mean_days_permit_type = df.groupby('Permit Type Definition')['Days to Issue'].mean().reset_index()
mean_days_permit_type = mean_days_permit_type.sort_values(by='Days to Issue')
mean_days_permit_type.columns = ['Permit Type Definition', 'Avg Days to Issue']
result_df = pd.merge(permit_type_counts, mean_days_permit_type, on='Permit Type Definition')
display(result_df)
| Permit Type Definition | Count | Avg Days to Issue | |
|---|---|---|---|
| 0 | otc alterations permit | 170524 | 10.870757 |
| 1 | additions alterations or repairs | 9439 | 247.252251 |
| 2 | sign - erect | 2403 | 52.724511 |
| 3 | new construction wood frame | 601 | 398.143095 |
| 4 | demolitions | 377 | 348.175066 |
| 5 | wall or painted sign | 362 | 45.348066 |
| 6 | new construction | 176 | 478.602273 |
| 7 | grade or quarry or fill or excavate | 78 | 97.307692 |
Days to Visit¶
How we can understand, which day of the week is the best for Department of Building Inspections visit?
We can look at how many days it took for considering Permit, depending on which day of the week the Permit was Filed
# Extract the weekday from the 'Filed Date' column
df['Weekday'] = df['Filed Date'].dt.day_name()
# Calculate the average days to issue for each weekday
average_days_per_weekday = df.groupby('Weekday')['Days to Issue'].mean().reset_index()
# Define the order of the weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Sort the DataFrame by the defined order
average_days_per_weekday['Weekday'] = pd.Categorical(average_days_per_weekday['Weekday'], categories=weekday_order, ordered=True)
average_days_per_weekday = average_days_per_weekday.sort_values('Weekday')
# Create a bar plot using Plotly
fig = px.bar(average_days_per_weekday, x='Weekday', y='Days to Issue',
title='Average Days to Issue for Each Day of the Week',
labels={'Weekday': 'Day of the Week', 'Days to Issue': 'Average Days to Issue'},
text='Days to Issue',
color='Weekday',
category_orders={'Weekday': weekday_order})
# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-45, height=800)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
# Show the plot
fig.show()
Further Analysis¶
Let us look
numeric_columns = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_columns.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix')
plt.show()
# Select relevant columns
df_rel = df[['Neighborhoods - Analysis Boundaries', 'Existing Construction Type Description', 'Permit Type']]
# Create a Heatmap for Neighborhood vs Construction Type vs Permit Type
plt.figure(figsize=(15, 8))
heatmap_data = pd.crosstab(index=df_rel['Neighborhoods - Analysis Boundaries'],
columns=df_rel['Existing Construction Type Description'],
values=df_rel['Permit Type'],
aggfunc='count',
normalize='index')
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt=".2f", cbar_kws={'label': 'Percentage of Permits'})
plt.title('Neighborhoods vs Construction Types vs Permit Types')
plt.xlabel('Construction Types')
plt.ylabel('Neighborhoods')
plt.show()
df['Permit Type'].value_counts()
Permit Type 8 170524 3 9439 4 2403 2 601 6 377 7 362 1 176 5 78 Name: count, dtype: int64
Prediction Model¶
Random Forest Regressor¶
#datatype of the permit type column
df['Permit Type']
dtype('int64')
from sklearn.ensemble import RandomForestRegressor
# Preparation and Modeling
df['Day_of_Week'] = df['Filed Date'].dt.dayofweek
df['Month'] = df['Filed Date'].dt.month
selected_features = [
'Number of Existing Stories', 'Estimated Cost', 'Day_of_Week', 'Month',
'Existing Construction Type', 'Revised Cost', 'Existing Units', 'Permit Type'
]
df_model = df[selected_features + ['Days to Issue']].dropna()
X = df_model.drop('Days to Issue', axis=1)
y = df_model['Days to Issue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
model_days_to_issue = RandomForestRegressor(random_state=40)
model_days_to_issue.fit(X_train, y_train)
RandomForestRegressor(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=40)
# Training and Evaluation
from sklearn.metrics import mean_squared_error, r2_score
y_pred_train = model_days_to_issue.predict(X_train)
y_pred_test = model_days_to_issue.predict(X_test)
mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
print(f'Training Set - Mean Squared Error: {mse_train}, R-squared: {r2_train}')
print(f'Testing Set - Mean Squared Error: {mse_test}, R-squared: {r2_test}')
Training Set - Mean Squared Error: 648.598706950446, R-squared: 0.9208572781227299 Testing Set - Mean Squared Error: 3549.8507175360546, R-squared: 0.5613536531588268
Gradient Boosting Regressor¶
from sklearn.ensemble import GradientBoostingRegressor
model_gb = GradientBoostingRegressor(random_state=40)
model_gb.fit(X_train, y_train)
y_pred_train_gb = model_gb.predict(X_train)
y_pred_test_gb = model_gb.predict(X_test)
mse_train_gb = mean_squared_error(y_train, y_pred_train_gb)
r2_train_gb = r2_score(y_train, y_pred_train_gb)
mse_test_gb = mean_squared_error(y_test, y_pred_test_gb)
r2_test_gb = r2_score(y_test, y_pred_test_gb)
print(f'Gradient Boosting - Training Set - Mean Squared Error: {mse_train_gb}, R-squared: {r2_train_gb}')
print(f'Gradient Boosting - Testing Set - Mean Squared Error: {mse_test_gb}, R-squared: {r2_test_gb}')
Gradient Boosting - Training Set - Mean Squared Error: 3911.5567887013053, R-squared: 0.5227075729292396 Gradient Boosting - Testing Set - Mean Squared Error: 4013.9157931916643, R-squared: 0.5040102699209544
Support Vector Regressor (SVR)¶
from sklearn.svm import SVR
model_svr = SVR(kernel='rbf')
model_svr.fit(X_train, y_train)
y_pred_train_svr = model_svr.predict(X_train)
y_pred_test_svr = model_svr.predict(X_test)
mse_train_svr = mean_squared_error(y_train, y_pred_train_svr)
r2_train_svr = r2_score(y_train, y_pred_train_svr)
mse_test_svr = mean_squared_error(y_test, y_pred_test_svr)
r2_test_svr = r2_score(y_test, y_pred_test_svr)
print(f'SVR - Training Set - Mean Squared Error: {mse_train_svr}, R-squared: {r2_train_svr}')
print(f'SVR - Testing Set - Mean Squared Error: {mse_test_svr}, R-squared: {r2_test_svr}')
SVR - Training Set - Mean Squared Error: 8116.616614317737, R-squared: 0.009601585066902452 SVR - Testing Set - Mean Squared Error: 8038.314680830896, R-squared: 0.00672516957174496
XGBoost Regressor¶
import xgboost as xgb
model_xgb = xgb.XGBRegressor(random_state=40)
model_xgb.fit(X_train, y_train)
y_pred_train_xgb = model_xgb.predict(X_train)
y_pred_test_xgb = model_xgb.predict(X_test)
mse_train_xgb = mean_squared_error(y_train, y_pred_train_xgb)
r2_train_xgb = r2_score(y_train, y_pred_train_xgb)
mse_test_xgb = mean_squared_error(y_test, y_pred_test_xgb)
r2_test_xgb = r2_score(y_test, y_pred_test_xgb)
print(f'XGBoost - Training Set - Mean Squared Error: {mse_train_xgb}, R-squared: {r2_train_xgb}')
print(f'XGBoost - Testing Set - Mean Squared Error: {mse_test_xgb}, R-squared: {r2_test_xgb}')
XGBoost - Training Set - Mean Squared Error: 2764.717153027853, R-squared: 0.6626461965362527 XGBoost - Testing Set - Mean Squared Error: 3746.257006247557, R-squared: 0.5370842097666461
Conclusion¶
While we were able to get some alaysis from this dataset that will guide businesses in making decisions. The prediction on how long it takes a permit to get approved still needs some work.